-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Asynchronous corpus updates #46
base: master
Are you sure you want to change the base?
Conversation
…ent to os.makedirs
I guess you could have a dedicated job dispatching task that does while True:
await sleep(...)
job = await q.get()
asyncio.ensure_future(job) I guess the tricky bit then would be figuring out when to shut everything down... the problem is you don't know ahead of time how many jobs you'll need... |
I was thinking that something like This really seems like there should be a ready, nice pattern from somewhere in the concurrent programming universe…isn't part of a server's job being able to handle the case where you don't know ahead of time how many jobs you'll need? |
@mpacer, happy to explain the different checking steps of
So 1 and 4 are standard checks that could be pretty easily incorporated into an async workflow, whereas 2 and 3 are checking a different set of files and/or downloading or comparing files than are likely not to be in the temporary directory. |
@mpacer Could you resolve the merge conflicts? |
I'm not too concerned with the merge conflicts as I don't want this to be merged yet. It's still a long way off & I am going to need a lot more time to think about how to approach this. When I push more commits I'll definitely fix the conflicts, but I'll wait until then so that I fix all the merge conflicts all at once. That's generally how I approach slow moving WIP PRs and merge conflicts since it only matters at the point when it's no longer WIP. |
Ok, I didn't know, my criteria was not to open a PR until ready to merge.
…On Dec 9, 2017 4:12 PM, "M Pacer" ***@***.***> wrote:
I'm not too concerned with the merge conflicts as I don't want this to be
merged yet. It's still a long way off & I am going to need a lot more time
to think about how to approach this. When I push more commits I'll
definitely fix the conflicts, but I'll wait until then so that I fix all
the merge conflicts all at once. That's generally how I approach slow
moving WIP PRs and merge conflicts since it only matters at the point when
it's no longer WIP.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#46 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAOjta9_10kQG8zQLniXA2kJNXjBpjHFks5s-yHngaJpZM4QxyFx>
.
|
This is a first pass at making the corpus update commands work asynchronously.
There's still a lot I need to do, but this is enough to at least get the ball rolling about how to think about this.
In order to use
async
/await
syntax it requires python3.5+, so I updated the requirements accordingly.Some of the code I have there originates from the original tutorials I was splicing together to accomplish this… so if something seems to not make sense, please point it out so I can address that non-sensemaking-ness.
Notes to individuals
@eseiver
Explaining `plos_corpus.download_check_and_move`
Could you describe explicitly what the reasoning behind each of the steps of your `plos_corpus.download_check_and_move` function from the perspective of working on an individual file? Right now the approach is to process everything as a batch in steps, but from an async perspective, that actually works against using the available cycles efficiently.Review of `Article.from_bytes`
Also… if I don't use the BytesIO wrapper and read the bytes directly, the object that is created when using et.from_string has different content than what is currently being saved to disk.I don't like the
from_bytes
method having the ability to write articles to the corpus, but it seems like there should be some way to do that. I wanted to start with minimal surface area on existing objects, so I just included a single method that did everything I want. Nonetheless, I think that there's a lot of API discussions that we'll need to have around this.@njsmith
it would be nice to be able to add tasks from inside the inner loop (`fetch`, based on how the code is structured as of 07ffabc )… That way we could trigger the new corrected article download if it was needed not in a batch after the fact, but all at once. Is the only way to add tasks from an inner coroutine to either use a global variable, define the inner loop as a closure (so it has access to the higher scope list), or to pass it in? It seems like that kind of state management could be handled more cleanly (e.g., a global variable seems like a terrible idea), but I'm not sure how to approach this.